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Good old on-line back-propagation for plain multi-layer perceptrons yields a very 
low 0.35% error rate on the famous MNIST handwritten digits benchmark. All we 
need to achieve this best result so far are many hidden layers, many neurons per layer, 
numerous deformed training images, and graphics cards to greatly speed up learning. 



Abstract 



1 Introduction 



Automatic handwriting recognition is of great academic and commercial interest. Cur- 
rent algorithms are already pretty good at learning to recognize handwritten digits. Post 

http://yann.lecun.com/exdb/mnist/ 



offices use them to sort letters; banks use them to read personal checks. MNIST (|LeCun 



et al. 1998 ) is the most widely used benchmark for isolated handwritten digit recogni- 



tion. More than a decade ago, artificial neural networks called Multilayer Perceptrons or 
MLPs ( |Werbos[ |1974[ |LeCun[ |1985[ |Rumelhart et al.[ |1986| ) were among the first clas- 
sifiers tested on MNIST. Most had few layers or few artificial neurons (units) per layer 



(LeCun et al. 1998), but apparently back then they were the biggest feasible MLPs, 
trained when CPU cores were at least 20 times slower than today. A more recent MLP 



with a single hidden layer of 800 units achieved 0.70% error (Simard et al. 20031. 
However, more complex methods listed on the MNIST web page always seemed to 
outperform MLPs, and the general trend went towards more and more complex variants 
of Support Vector Machines or SVMs ( Decoste & Scholkopfj 2002[ ) and combinations 



of NNs and SVMs (Lauer et al. , 2007) etc. Convolutional neural networks (CNNs) 



achieved a record-breaking 0.40% error rate (|Simard eta l. 2003), using novel elastic 
training image deformations. Recent methods pre-train each hidden CNN layer one by 
one in an unsupervised fashion (this seems promising especially for small training sets), 



then use supervised learning to achieve 0.39% error rate (Ranza to et al.[ |2006[ |2007| ). 
The biggest MLP so far ( |Sal akhutdi nov et aX] |2007| ) also was pre-trained without su- 
pervision then piped its output into another classifier to achieve an error of 1% without 
domain- specific knowledge. 

Are all these complexifications of plain MLPs really necessary? Can't one simply 
train really big plain MLPs on MNIST? Why is there no literature on this? One reason 
is that at first glance deep MLPs do not seem to work better than shallow networks 
(Ben gio et al.[|2006"] ). Training them is hard as back-propagated gradients quickly van- 
ish exponentially in the number of layers (Hochreiter 1991 , Hochreiter et al.[ 2001 



Hinton[ 2007[ ), just like in the first recurrent neural networks (Hochreiter & Schmidhu- 



ber, 1997). Indeed, previous deep networks successfully trained with back-propagation 



(BP) either had few free parameters due to weight- sharing (e.g. LeCu n et al.[ [1998; 



Simard et al. , 2003) or used unsupervised, layer- wise pre-training (e.g. Bengio et al. 



2006; Ranzato et al. 2006). But is it really true that deep BP-MLPs do not work at 



all, or do they just need more training time? How to test this? Unfortunately, on-line 
BP for hundreds/thousands of epochs on large MLPs may take weeks or months on 
standard serial computers. But can't one parallelize it? Well, on computer clusters this 
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is hard due to communication latencies between individual computers. Multi-threading 
on a multi-core processor is not easy either. We may speed up BP using SSE (Streaming 
Single Instruction, Multiple Data Extensions), either manually, or by setting appropriate 
compiler flags. The maximum theoretical speedup under single precision floating-point, 
however, is four, which is not enough. And MNIST is large - its 60,000 images take 
almost 50MB, too much to fit in the L2/L3 cache of any current processor. This requires 
to continually access data in considerably slower RAM. To summarize, currently it is 
next to impossible to train big MLPs on CPUs. 

We will show how to overcome all these problems by training large, deep MLPs on 
graphics cards. 



2 Data 

MNIST consists of two datasets, one for training (60,000 images) and one for testing 
(10,000 images). Many studies divide the training set into two sets consisting of 50,000 
images for training and 10,000 for validation. Our network is trained on slightly de- 
formed images, continually generated in on-line fashion; hence we may use the whole 
un-deformed training set for validation, without wasting training images. Pixel intensi- 
ties of the original gray scale images range from (background) to 255 (max foreground 
intensity). 28 x 28 = 784 pixels per image get mapped to real values pixel ™^ sit y - 1.0 
in [—1.0, 1.0], and are fed into the NN input layer. 



3 Architectures 

We train 5 MLPs with 2 to 9 hidden layers and varying numbers of hidden units. Mostly 
but not always the number of hidden units per layer decreases towards the output layer 
(Table[T|). There are 1.34 to 12.11 million free parameters (or weights, or synapses). 
We use standard on-line BP (e.g. |Russell & NorvTgl |2002[ pages 744-748), without 



momentum, but with a variable learning rate that shrinks by a multiplicative constant af- 
ter each epoch, from 10~ 3 down to 10~ 6 . Weights are initialized with a uniform random 
distribution in [—0.05, 0.05]. Each neuron's activation function is a scaled hyperbolic 



tangent: y(a) = At&nhBa, where A = 1.7159 and B = 0.6666 (LeCun et al., 1998). 
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4 Deforming images to get more training instances 



So far, the best results on MNIST were obtained by deforming training images, thus 
greatly increasing their number. This allows for training networks with many weights, 
making them insensitive to in-class variability. We combine affine (rotation, scaling 
and horizontal shearing) and elastic deformations, characterized by the following real- 
valued parameters: 

• a and a: for elastic distortions emulating uncontrolled oscillations of hand mus- 
cles (for details see |Simard et al4[2003| ); 



• (3: a random angle from [—/3,+/3] describes either rotation or horizontal shearing. 
In case of shearing, tan (3 defines the ratio between horizontal displacement and 
image height; 

• lx, J y - for horizontal and vertical scaling, randomly selected from [1 — 7/IOO, 1 + 
7/100]. 

At the beginning of every epoch the entire MNIST training set gets deformed. Initial 
experiments with small networks suggested the following deformation parameters: a = 
5.0 — 6.0, a = 36.0 — 38.0, 7 = 15 — 20. Since digits 1 and 7 are similar they get 
rotated/sheared less (/3 = 7.5°) than other digits (fi = 15.0°). 



5 Results 

All simulations were performed on a computer with a Core2 Quad 9450 2.66GHz pro- 
cessor, 3GB of RAM, and a GTX280 graphics card. The GPU accelerates the defor- 
mation routine by a factor of 10 (only elastic deformations are GPU-optimized); the 
forward propagation (FP) and BP routines are sped up by a factor of 40. Implementa- 
tion details can be found in the Appendix. We pick the trained MLP with the lowest 
validation error, and evaluate it on the MNIST test set. Results are summarized in 
Table ED 

Most remarkably, the best network has an error rate of only 0.35% (35 out of 10,000 
digits). This is significantly better than the best previously published results, namely, 



0.39% by Ranzato et al. (2006) and 0.40% by Simard et al. (2003), both obtained by 



more complex methods. The 35 misclassified digits are shown in Figure [T] Many of 
them are ambiguous and/or uncharacteristic, with obviously missing parts or strange 
strokes etc. Interestingly, the second guess of the network is correct for 30 out of the 35 
misclassified digits. 

The best test error of this MLP is even lower (0.32%) and may be viewed as the 
maximum capacity of the network. Performance clearly profits from adding hidden 
layers and more units per layer. For example, network 5 has more but smaller hidden 
layers than network 4 (Table [j}. 

Networks with up to 12 million weights can successfully be trained by plain gradient 
descent to achieve test errors below 1% after 20-30 epochs in less than 2 hours of 
training. How can networks with so many parameters generalize well on the unseen 
test set? Answer: the continual deformations of the training set generate a virtually 
infinite supply of training examples, and the network rarely sees any training image 
twice. 



Table 1 : Error rates on MNIST test set. 



ID 


architecture 


test error for 


best test 


simulation 


weights 




(number of neurons in each layer) 


best validation [%] 


error [%] 


time [h] 


[mi lions] 


1 


1000, 500, 10 


0.49 


0.44 


23.4 


1.34 


2 


1500, 1000, 500, 10 


0.46 


0.40 


44.2 


3.26 


3 


2000, 1500, 1000, 500, 10 


0.41 


0.39 


66.7 


6.69 


4 


2500, 2000, 1500, 1000, 500, 10 


0.35 


0.32 


114.5 


12.11 


5 


9 x 1000, 10 


0.44 


0.43 


107.7 


8.86 



Conclusion 

In recent decades the amount of raw computing power per Euro has grown by a factor 
of 100-1000 per decade. Our results show that this ongoing hardware progress may be 
more important than advances in algorithms and software (although the future will be- 
long to methods combining the best of both worlds). Current graphics cards (GPUs) are 
already more than 40 times faster than standard microprocessors when it comes to train- 
ing big and deep neural networks by the ancient algorithm, on-line back-propagation 
(weight update rate up to 5 x 10 9 /s, and more than 10 15 per trained network). On 
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Figure 1: The 35 miss-classified digits of the best network from Table [T] together with 
the two most likely predictions (bottom, from left to right) and the correct label accord- 
ing to MNIST (top, right). 

the very competitive MNIST handwriting benchmark, single precision floating-point 
GPU-based neural nets surpass all previously reported results, including those obtained 
by much more complex methods involving specialized architectures, unsupervised pre- 
training, combinations of machine learning classifiers etc. Training sets of sufficient 
size are obtained by appropriately deforming images. Of course, the approach is not 
limited to handwriting, and obviously holds great promise for many visual and other 
pattern recognition problems. 
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Appendix - GPU implementation 

Graphics Processing Unit 

Until 2007 the only way to program a GPU was to translate the problem- solving al- 
gorithm into a set of graphical operations. Despite being hard to code and difficult to 
debug, several GPU-based NN implementations were developed when GPUs became 
faster than CPUs. Two layer MLPs qSteinkraus eTaLj [2003] ) and CNNs ( |Chellapilla 



et al.[ |2005[ ) have been previously implemented on GPUs. Although speedups were 
relatively modest, these studies showed how GPUs can be used for machine learning. 
More recent GPU-based CNNs trained in batch mode are two orders of magnitude faster 



than CPU-based CNNs dScherer & Behnke, 2009). 



In 2007, NVIDIA developed the first version of CUDA (Compute Unified Device 
Architecture), a C-like general programming language. GPU speed and memory band- 
width are vastly superior to those of CPUs, and crucial for fast MLP implementations. 
To fully understand our algorithm in terms of GPU / CUDA, please visit the NVIDIA 



website (NVIDIA, 2009). According to CUDA terminology, the CPU is called host and 



the graphics card device or GPU. 



Deformations 



It takes 93 CPU seconds to deform the 60,000 MNIST training images, most of them 
(87) for elastic distortions. Only the most time-consuming part of the latter - convolu- 
tion with a gaussian kernel ( |Simard et aL| |2003| ) - is ported to the GPU. The MNIST 



training set is split into 600 sequentially processed batches. MNIST digits are scaled 
from the original 28 x 28 pixels to 29 x 29 pixels, to get a proper center, which simpli- 
fies convolution. An image grid has 290 x 290 cells, zero-padded to 300 x 300, thus 
avoiding margin effects when applying a gaussian convolution kernel of size 21 x 21. 
Our GPU program groups many threads into a block, where they share the same gaus- 
sian kernel and parts of the random field. The blocks contain 21 (the kernel size) x 10 
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Listing 1: Convolution Kernel for elastic distortion. 



__global__ void ConvolveField_optimized(float *randomfield, int width, int height, float *kernel, float *outputfield, float elasticScale){ 
float sum=0; 

const int stride_k=GET_STRIDE(GAUSSIAN_FIELD_SIZE,pitch_x>>2); //stride for gaussian kernel 

-shared- float K[GAUSSIAN.FIELD_SIZE][stride.k]; //kernel (21 x 32 values) 

..shared., float R[GAUSSIAN.FIELD.SIZE+9][GAUSSIAN.FIELD_SIZE]; //random field (30 x 21 values) 

..shared., float s[10][GAUSSIAN_FIELD_SIZE]; //partial sums (10 x 21 values) 

int strideJn=GET_STRIDE(width,pitch_x>>2); //random field stride as a multiple of 32 

int stride_out=GET_STRIDE(width — GAUSSIAN .FIELD.SIZE+1 ,pitch_x>>2); //output stride as a multiple of 32 

//loading gaussian kernel into K (21 x 21 values) 

K[ 0+threadidx,y][threadldx.x] = kernel[( 0+threadldx.y)*stride.k + threadldx.x];//rows 0..9 
K[1 0+threadldx,y][threadidx.x] = kernel[(1 0+threadldx.y)*stride_k + threadldx.x];//rows 10.. 19 
if(threadidx.y==0) 

K[20+threadldx.y][threadldx.x] = kernel[(20+threadldx.y)*stride.k + threadldx.x];//row 20 

//loading randomfield into R 
//0..9 x 21 values 

R[ 0+threadldx.y][threadldx.x] = randomfield[(10*blockldx.y+ 0+threadldx.y}*strideJn + blockldx.x + threadldx.x]; 

//10..19 x 21 values 

R[10+threadldx.y][threadldx.x] = randomfield[(10*blockldx.y+10+threadldx.y)*stride_in + blockldx.x + threadldx.x]; 

//20..29 x 21 values 

R[20+threadldx.y][threadldx.x] = randomfield[(10*blockldx.y+20+threadldx.y)*stride.in + blockldx.x + threadldx.x]; 
__ syncthreads(); //wait until everything is read into shared memory 

//computing partial sums 

tpragma unroll 21 //GAUSSIAN.FIELD.SIZE 

for(inti=0;i<GAUSSIAN_FIELD_SIZE;i++) 

sum += R[threadldx.y + i][threadldx.x] * K[i][threadldx.x]; 
s[threadldx.y][threadldx.x]=sum; 
„ syncthreadsQ; 

if(threadldx.x==0){ //the first column of threads compute the final values of the convolutions 
#pragma unroll 20//GAUSSIAN.FIELD.SIZE — 1 
forfint i=1 ;i<GAUSSIAN.FIELD.SIZE;i++) sum+=s[threadldx.y][i]; 
outputfield[(blockldx.y*10+threadldx.y)*stride_out + blockldx.x] = sum * elasticScale; 

} 

} 



threads, each computing a vertical strip of the convolution operation (Listing [T]). 

Generating the elastic displacement field takes only 3 seconds. Deforming the whole 
training set is more than 10 times faster, taking 9 instead of the original 93 seconds. Fur- 
ther optimization would be possible by porting all deformations onto the GPU, and by 
using the hardware's interpolation capabilities to perform the final bilinear interpola- 
tion. We omitted this since deformations are already pretty fast (deforming all images 
of one epoch takes only 5-15 % of total computation time, depending on MLP size). 
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Training algorithm 



We closely follow the standard BP algorithm (e.g. Russell & Norvig, 2002[ pages 744- 
748), except that BP of deltas and weight updates are disentangled and performed se- 
quentially. This allows for more parallelism within each routine. 



Forward propagation 

The algorithm is divided into two kernels. The weight matrix W is partitioned as illus- 
trated in Figure [2} 



^256 threads^ 



OlA A 



32 threads 



w 



c) 



Figure 2: Forward propagation: a) mapping of kernel 1 grid onto the padded weight 
matrix; b) mapping the kernel 2 grid onto the partial dot products matrix; c) output of 
forward propagation. 

Kernel 1 

Each block has 256 threads (Figure [2^), each computing a partial dot product of 
32 component vectors. The dot products are stored in a temporary matrix (Figure [2})). 
This kernel has a very high throughput: average memory bandwidth is 115GB/s. This 
is possible because many relatively small blocks keep the GPU busy. Each block uses 
shared memory for storing the previous layer activations, which are simultaneously 
read by the first 32 threads of each block and then used by all 256 threads. After 
thread synchronization, the partial dot products are computed in parallel (Listing [2]). 
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The number of instructions is kept to a minimum by pre-computing all common index 
parts. 

Listing 2: Forward propagation kernels. 

__global__ void MLP_FP_reduction_Kernel1 (float *prevl_N, float *W, float * partial sum, unsigned int neurons, unsigned int prevneurons){ 
const int threads=256; 

const int stride=GET_STRIDE(neurons,pitch_x>>2); //horizontal stride of W matrix 

int X=blockldx.x*threads + threadldx.x; //precomputing expressions 

int Y=X+stride*blockldx.y; 

int Z=blockldx.y*pitch_y*stride + X; 

float sum=0. Of; 

..shared- float output[pitch_y]; 
if(blockldx.y==0) 

if(threadldx.x==0) output[0]=1 .Of; 

else if(threadldx.x<pitch_y) //there are only 32 values to read and 128 threads 

output[threadldx.x] = threadldx.x — 1 <prevneurons ? prevLN [threadldx.x— 1] : O.Of; 
else; 

else if(threadldx.x<pitch_y) //there are only 32 values to read and 128 threads 

output[threadldx.x] = blockldx.y*pitch_y+threadldx.x — 1 <prevneurons ? 

prevLN[blockldx.y*pitch_y+threadldx.x— 1] : O.Of; 

else; 
__syncthreads(); 

if(X<neurons){//compute partial sums 
//#pragma unroll 32 
int size=0; 

if((blockldx.y+1)*pitch_y>=prevneurons+1) 

size = prevneurons + 1 — blockldx.y*pitch_y; 
else size=pitch_y; 
for (int ic=0; ic<size; ic++){ 

sum += outputpc] * W[Z]; 

Z+=stride; 

> 

partialsum[Y]=sum; 

} 

} 

—global— void MLP_FP_reduction_Kernel2(float *currl_N, float *partialsum, unsigned int neurons, unsigned int size){ 
float sum=0. Of; 

int idx = blockldx.x*(pitch_x> >2) + threadldx.x; //precomputed index 

unsigned int stride = GET_STRIDE(neurons,pitch_x>>2); //stride for partialsum matrix 

if(idx>=neurons)return; //is this thread computing a true neuron? 

for (int i=0; i<size; i++) sum += partialsum[i*stride+idx]; //computing the final dot product 

currl_N[idx] = SIGMOIDF(sum); //applying activation 

} 



Kernel 2 

The thread grid (Figure [2^) has only one row of blocks consisting of warp threads, 
since each thread has to compute a complete dot product (Figure |2j:) and then pipe it 
into the activation function. This kernel (Listing [2]) is inefficient for layers with fewer 
than 1024 incoming connections per neuron, especially for the last layer which has only 
ten neurons, one for each digit. That is, its grid will have only one block, occupying 
only 3% of the GTX280 GPU. 
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Backward propagation 

This is similar to FP, but we need W T for coalesced access. Instead of transposing the 
matrix, the computations are performed on patches of data read from device mem- 
ory into shared memory, similar to the optimized matrix transposition algorithm of 



Ruetsch & Micikevicius (2009). Shared memory access is much faster, without co- 



alescing restrictions. Because we have to cope with layers of thousands of neurons, 
back-propagating deltas uses a reduction method implemented in two kernels commu- 
nicating partial results via global memory (Listing [3]). 

Kernel 1 The bi-dimensional grid is divided into blocks of warp (32) threads. The 
kernel starts by reading a patch of 32 x 32 values from W. The stride of the shared 
memory block is 33 (warp + 1), thus avoiding all bank conflicts and significantly im- 
proving speed. Next, 32 input delta values are read and all memory locations that do 
not correspond to real neurons (because of vertical striding) are zero-padded to avoid 
branching in subsequent computations. The number of elements is fixed to warp size, 
and the computing loop is unrolled for further speedups. Before finishing, each thread 
writes its own partial dot product to global memory. 

Kernel 2 

This kernel completes BP of deltas by summing up partial deltas computed by the 
previous kernel. It multiplies the final result by the derivative of the activation function 
applied to the current neuron's state, and writes the new delta to global memory. 



Weight updating 

The algorithm (Listing [4]) starts by reading the appropriate delta, and pre-computes 
all repetitive expressions. Then the first 16 threads read the states from global mem- 
ory into shared memory. The "bias neuron" with constant activation 1.0 is dealt with 
by conditional statements, which could be avoided through expressions containing the 
conditions. Once threads are synchronized, each single thread updates 16 weights in a 
fixed unrolled loop. 
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Listing 3: Backpropagating deltas kernels. 



__global__ void backPropagateDeltasFC_s2_A(float *indelta, float ^weights, unsigned int neon, unsigned int nrneur, float *partial){ 
const int px = pitch_x>>2; 
unsigned int stride_x = GET_STRIDE(nrneur,px); 
unsigned int stride.y = GET_STRIDE(ncon,pitch_y); 
float outd = 0.0; 

int idx = blockldx.x*px+threadldx.x; 

int X = blockldx.y*pitch_y*stride_x + idx; 

int Y - threadldx.x; 

__shared- float w[32*33]; //pitch_y and px should be equal I +1 to avoid bank conflict! 
..shared- float id[px]; //input delta 

#pragma unroll 32 //read the weight patch in shared memory 
for(int i=0;i<pitch_y;i++){w[Y]=weights[X]; X+=stride_x; Y+=33;} 
//read the input delta patch in shared memory 

if{idx>=nrneur) id[threadldx.x]=0; //a fake input delta for inexistent indelta 
else id[threadldx.x]=indelta[idx]; 

__syncthreads(); //not needed for block with warp number of threads: implicit synchronization 

#pragma unroll 32 //compute partial results 

for(int i=0;i<px;i++) outd+=w[threadldx.x*33+i]*id[i]; 

//write out the partial results 

partial[blockldx.x*stride_y + blockldx.y*pitch_y + threadldx.x] = outd; 

} 

..global— void backPropagateDeltasFC_s2_B(float *outdelta,float *instates, unsigned int neon, unsigned int nrneur, float *partial){ 
int px=pitch_x>>2; 

unsigned int stride_x = GET_STRIDE(nrneur,px); 

unsigned int stride.y = GET_STRIDE(ncon,pitch_y); 

float outd = 0.0; 

int size=stride_x/px; 

int idx=blockldx.x*pitch_y+threadldx.x; 

if(idx==0); //true only for block and thread 

else{ 

for(int i=0;i<size;i++) 

outd+=partial[i*stride.y + idx]; 
outdelta[idx-1] = outd * DSIGMOIDF(instates[idx-1]); //-1 BIAS ... 

} 

} 



Listing 4: Weights adjustment kernel. 

..global— void adjustWeightsFC_s1 (float *states, float *deltas, float *weights, float eta, unsigned int neon, unsigned int nrneur){ 
const int pitch_y=16; 
const int threads=256; 
unsigned int px = pitch_x > > 2; 
unsigned int stride_x = GET_STRIDE(nrneur,px); 
float etadeltak = eta*deltas[blockldx.x*threads+threadldx.x],t; 
int b=blockldx.y*stride_x*pitch_y + threads*blockldx.x + threadldx.x; 
..shared- float st[pitch_y]; //for states 
intcondl = blockldx.y || threadldx.x; 
int cond2 = (blockldx.y+1 )*pitch_y<=ncon; 
int size = cond2 * pitch.y + !cond2 * (ncon%pitch_y); 

if(threadldx.x<pitch_y) st[threadldx.x] = condl * states[blockldx.y*pitch_y + threadldx.x — 1 ] + !cond1 ; 
__syncthreads(); 

if (blockldx.x*threads + threadldx.x < nrneur){ 
#pragma unroll 16 
for (intj=0;j<16;j++){ 

t=weights[b]; 

t-= etadeltak * stfl]; 

weights[b]=t; 

b+=stride_x;}} 
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